Implementing a BNC-Compare-able Web Corpus
نویسنده
چکیده
This paper details the author’s plans for and progress with compiling and analyzing a new gigaword English corpus from the web to complement his BNC-based online database “Phrases in English”. This new corpus represents the principal English-speaking countries in proportion to their population and will be linguistically annotated with the CLAWS4 tagger using a PoS-tagset comparable to those of the BNC and ANC. Parallel processing on multiple PCs will facilitate reaching the targeted size. This corpus will continue to grow dynamically in response to actual user queries to the author’s various web as corpus interfaces, but “snapshots” of each generation of the corpus will be preserved to ensure replicability of results. This report on work in progress will inspire discussion of the underlying concepts and suggestions for improvement.
منابع مشابه
The Creation of a Spoken Sub-Corpus from the British National Corpus for Comparative Purposes
The British National Corpus (henceforth BNC) is one of the most frequently consulted corpora in linguistic research. While the use of this corpus is continuously on the increase, it appears that most BNC-related research work has exploited the corpus in its entirety, i.e. taking the corpus as a whole in analysing specific features or comparing with a different reference corpus. Despite the fact...
متن کاملComparing Knowledge Sources for Nominal Anaphora Resolution
We compare two ways of obtaining lexical knowledge for antecedent selection in other-anaphora and definite noun phrase coreference. Specifically, we compare an algorithm that relies on links encoded in the manually created lexical hierarchy WordNet and an algorithm that mines corpora by means of shallow lexico-semantic patterns. As corpora we use the British National Corpus (BNC), as well as th...
متن کاملWeb as Corpus
The corpus resource for the 1990s was the BNC. Conceived in the 80s, completed in the mid 90s, it was hugely innovative and opened up myriad new research avenues for comparing different text types, sociolinguistics, empirical NLP, language teaching and lexicography. But now the web is with us, giving access to colossal quantities of text, of any number of varieties, at the click of a button, fo...
متن کاملCorpus Linguistics with BNCweb - a Practical Guide
Book synopsis This book presents a richly illustrated, hands-on discussion of one of the fastest growing fields in linguistics today. The authors address key methodological issues in corpus linguistics, such as collocations, keywords and the categorization of concordance lines. They show how these topics can be explored step-by-step with BNCweb, a user-friendly web-based tool that supports soph...
متن کاملClarifying the Concepts and Navigating a Path through the Bnc Jungle
In this paper, an attempt is first made to clarify and tease apart the somewhat confusing terms genre, register, text type, domain, sublanguage, and style. The use of these terms by various linguists and literary theorists working under different traditions or orientations will be examined and a possible way of synthesising their insights will be proposed and illustrated with reference to the d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007